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Abstract. An increasing amount of data is published on the Web accord- 
ing to the Linked Open Data (LOD) principles. End users would like to 
browse these data in a flexible manner. In this paper we focus on similarity- 
based browsing and we introduce a novel method for computing the simi- 
larity between two entities of a given RDF/S graph. The distinctive char- 
acteristics of the proposed metric is that it is generic (it can be used to 
compare nodes of any kind), it takes into account the neighborhoods of 
the nodes, and it is configurable (with respect to the accuracy vs compu- 
tational complexity tradeoff). We demonstrate the behavior of the metric 
using examples from an application over LOD. Finally, we generalize and 
elaborate on implementation approaches harmonized with the distributed 
nature of LOD which can be used for computing the most similar entities 
using neighborhood-based similarity metrics. 



1 Introduction 

The last years a vast amount of structured data has been published as Linked Open 
Data (LOD). However, in their current form, they cannot be directly exploited by 
end users, since better linking, browsing, presentation is required (interaction and 
interfaces is one of the main research challenges of LOD according to [1]). Our 
objective is to investigate generic methods for browsing and exploring such data 
sets. Context and motivation for our work was the design and development of an 
online movie exploration system based on Semantic Web technologies, whose data 
are fetched dynamically from the LOD cloud, and offers similarity -based browsing 
for bypassing the need for query formulation by end users. 

In this paper, we motivate the need for similarity-based browsing, we iden- 
tify related requirements, and we introduce a new similarity function for tack- 
ling them. In brief the proposed similarity between two RDF nodes is actually 
the Jaccard similarity coefficient evaluated over the nodes of the extended (ra- 
dius bounded) neighborhoods (containing both instance and schema nodes) of the 
compared nodes. A distinctive characteristic of this metric is that each node that 
participates to an intersection or union operation of the Jaccard similarity coeffi- 
cient, is weighted by a value based on its path distance from the compared nodes, 
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for promoting close matches over distant ones. In a nutshell, the distinctive char- 
acteristics of the proposed similarity metric is that: (a) it is type independent 
(it can compute similarity between any pair of resources), (b) it can be applied 
within a single KB (thus different from the methods which have been proposed for 
ontology matching), and (c) it offers to the designer (or end user) the flexibility 
to choose the appropriate depth depending on his needs (on accuracy or com- 
putational complexity). Subsequently, we describe implementation approaches for 
computing the most similar entities and we analyze implementation approaches 
which are harmonized with the distributed nature of LOD. In particular we show 
how a similarity function can be reversed for enabling the computation of similar 
pages over the LOD without having to access the entire corpus. Such methods can 
be used not only for the introduced similarity metric, but for neighborhood-based 
similarity metrics in general. 

The rest of this paper is organized as follows. Section [2] describes the mo- 
tivation and application context of our work. Section [3] discusses related works. 
Section U introduces the least number of symbols and notations required for defin- 
ing the similarity function. Section[5]introduces the similarity function and Section 
E] demonstrates its merits over the running example. Section[7]discusses implemen- 
tation approaches and shows how a similarity function can be reversed. Finally, 
Section [8] concludes the paper and identifies issues for further research. 

2 Application Context 

The context of our work is an application over the Linked Open Data (LOD) 
cloud. Our objective was to design and develop a system which allows the flexible 
exploration of movie information, based on information fetched from the LOD 
cloud. The distinctive characteristics of this system, called MovieSim, are: 

— All information is fetched from the LOD cloud. This not only automates in- 
formation updating, but enables the application to provide always up-to-date 
information. 

— It links the available in the LOD structured information, and enriches it with 
links to external information (plain Web pages). 

Specifically from LinkedMDB^] the data are fetched in RDF format, from its 
available SPARQL Endpoint, while from Freebas^l data cloud the data is fetched 
in JSON format through its provided API. Regarding the linking of the data ex- 
tracted from each source we did not face any difficulty, since LinkedMDB provides 
for each of its entities a Unique Identifier, through FreeBase's link, that represents 
it in Freebase's data cloud. 

Since most end users do not have the technical knowledge (or the willingness) 
to formulate explicit SPARQL queries, MovieSim provides a more user friendly 
interaction, namely (a) keyword-based retrieval and (b) similarity-based browsing. 

To support keyword-based retrieval MovieSim periodically fetches information 
from LinkedMdb and indexes it with the help of LARQ (Lucene+ARQ)@. The 

3 http: / /www.linkedmdb TorgT] 
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availability of an index makes the evaluation of keyword queries very fast. We will 
not describe this functionality in detail since keyword searching over structured 
data is not the focus of this paper. 

Similarity-based browsing aims at allowing users to explore the available infor- 
mation without having to formulate structured queries. Note that similarity-based 
browsing is mainly offered for browsing image and video databases (e.g. [S]), but 
(to the best of our knowledge) has not been applied over RDF data. 

Regarding the presentation of information, MovieSim supports various kinds of 
Web pages, each one having a different role. Keyword search is supported through 
a search box, while the results of the query are viewed by a different kind of page. 
The essential category of pages contains page types for showing information about: 

— actors, 

— directors, 

— editors, 

— movies, and 

— writers. 

Each page type presents information which is dynamically fetched and linked. 
In addition, the system provides a general purpose page type to show information 
about entity types that do not fall in one of the previous categories. Below we 
present the information that we fetch for each supported type, from each individual 
source. 



Movie 



attribute 


source 


Title 


LinkedMDB 


Runtime 


LinkedMDB 


Initial Release Date 


LinkedMDB 


Movie Actors 


LinkedMDB 


Movie Writers 


LinkedMDB 


Movie Directors 


LinkedMDB 


Movie Editors 


LinkedMDB 


Image 


Freebase 


Abstract 


Freebase 


Rating 


Freebase 


Tagline 


Freebase 


Genres 


Freebase 



Writer 


Writer Name 


LinkedMDB 


Films Writen 


LinkedMDB 


Image 


Freebase 


Abstract 


Freebase 


Birth Date 


Freebase 


Birth Place 


Freebase 


Nationality 


Freebase 



Actor 


attribute 


source 


Actor Name 


LinkedMDB 


Films Acted 


LinkedMDB 


Image 


Freebase 


Abstract 


Freebase 


Birth Date 


Freebase 


Birth Place 


Freebase 


Nationality 


Freebase 



Editor 


Editor Name 


LinkedMDB 


Films Edited 


LinkedMDB 


Image 


Freebase 


Abstract 


Freebase 


Birth Date 


Freebase 


Birth Place 


Freebase 


Nationality 


Freebase 



Director 


Director Name 


LinkedMDB 


Films Directed 


LinkedMDB 


Image 


Freebase 


Abstract 


Freebase 


Birth Date 


Freebase 


Birth Place 


Freebase 


Nationality 


Freebase 



General 


Title 


LinkedMDB 


Inbound Links 


LinkedMDB 


Outbound Links 


LinkedMDB 


Image 


Freebase 


Abstract 


Freebase 



While the user views the page of one entity he can continue browsing and 
exploring similar entities. The similar entities are computed using the similarity 
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function that we will describe later on. Since the similar entities can be numerous 
and of different types, only the entities with the highest similarity should be sug- 
gested. Figure [T] shows a screenshot of the Web page produced for the movie Da 
Vinci Code. 



MovleSim 



The Da Vinci Code 




Fig. 1. Movie Page 



Note that similarity-based browsing is actually an alternative (essentially com- 
plementary) approach to the facet-based browsing [55], which is supported by sys- 
tems like: BrowseRdf [24], Humboldt, VisiNav Q2], Longwell [25], Ontogator gS], 
/facet [14], Camelis2 [11]. Facet-based browsing also bypasses the query formula- 
tion effort. However, similarity-based browsing does not require from the user to 
select the relationship through which two entities are related. Instead, the similar- 
ity value actually quantifies several relationships (direct or path based) and offers 
an aggregated form of relevance. 

Similarity-based browsing can actually be offered in the context of a facet- 
based browsing system. Specifically, a new facet can be defined which shows the 
most similar entities. 

Figure [2] sketches the architecture of MovieSim. Its architecture is based on 
the MVC (Model View Controller) pattern, meaning that all business logic is 
implemented in Servlets and all communication and data transfer issues are dealt 
with the use of Java Beans (one for each entity type mentioned earlier). The 
presentation of data (page types) is specified using JSP pages in order to separate 
the presentation design from the application logic, making easier the extension 
and modification of the system. 
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MovieSim 



LUCENE+ARQ 



LOD 




LinkedMDB 



Freebase 



Fig. 2. The Architecture of MovieSim 



3 Related Work on Similarity over RDF/S 



Since we focus on similarity-based browsing, in this section we briefly review the 
related work that has been done. In general, with the rapid development of the 
Semantic Web, there has been an increased interest in developing methods for 
finding similarities between nodes in RDF/S graphs. There are several related 
works mainly for the problem of ontology matching. Below we list and comment 
in brief the more related works. 

[29) presents a method for computing the similarity between two entities coming 
from two different OWL DL ontologies. The computation of similarity is based on 
the extraction of information encoded in each entity's description. The extracted 
components are then compared, taking into account the predefined meanings of 
OWL DL and RDF(S) primitives, to produce partial component similarity val- 
ues, which are then combined using predefined weights under a variable weighting 
scheme. 

[7] also proposes a similarity function for entity matching between different 
OWL ontologies . 

There are also algorithms (again for the problem of ontology matching) which 
use the edit distance to find the lexical similarity between two entities, such as the 
MLMA+ algorithm [5] which, amongst other measures, makes use of the Leven- 
shtein (Edit) distance [15]. 

Another algorithm (for ontology matching) is presented in [T] for finding sim- 
ilarities between two entities, of some given ontologies based on the combination 
of structural and lexical information provided by the ontology, which is divided 
into three stages. In the first stage each entity is lexically analyzed, based on in- 
formation given from their labels and descriptions. The second stage involves the 
comparison of the entities based on the structure of the graph, while the third 
stage combines the results of the two previous stages and produces a final result 
that represents the similarity between the two entities. 

Another related work aiming at identifying cases where the same objects are 
identified by different URIs in different datasets, in the context of LOD, is [22 . 
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Finally, [2 7) proposes a metric for entity comparison in hierarchical ontologies 
(however that work exploits only hierarchical relationships and ignores proper- 
ties). 

Similar in spirit problem is that of blank node matching which aims at defining 
a mapping between the blank nodes of two KBs (related works include PromptDiff 
[53], Ontoview [IS], CWM [3J, RDFSync [3D]). 

To synopsize, most of the related works aim at finding similarities between 
entities of different knowledge bases. Therefore they mainly identify similarities 
between entities of the same type. Such approaches would not be convenient for 
our system, since we would have to design several class-specific similarity functions, 
i.e. similarity functions between movies and actors, directors and actors, writers 
and movies, and so on. For this reason, we decided to move towards a similarity 
computation method that is type-independent allowing the comparison of entities 
of the same or different types. At last we should note that the similarity function 
that we needed for our system, apart from being type-independent should exploit 
both the instance and the schema layer (for being able to compute similarities 
between entities which do not belong to the same classes). 

4 Background (RDF definitions and notations) 

An RDF Knowledge Base (KB) is defined as a set of RDF triples, denoted by K, 
each having the form (subject, predicate, object), for short (s,p,o). A KB K can 
also be viewed as a directed labeled graph G — (N, E) . The nodes of the graph are 
the URIs, the literals and the blank nodes that appear in the triples of K, while 
the edges of the graph are labeled arcs that connect the corresponding nodes. 

We shall use as running example the KB that is illustrated at Figure [3J For the 
sake of completeness, even if the LOD dataset did not have an explicitly defined 
schema, we have created one (for capturing the general case of RDF/S KBs). 
Furthermore, we added some extra entities [J apart from those fetched from LOD. 

All resources which are instances of a class are vertically aligned with the class. 
Below we introduce some notations which are necessary for defining the similarity 
metric. 

We shall use Pr to refer to the properties that occur in K . For a given resource 
u we shall use ResFrom(u) (resp. ResTo(u)) to denote the resources which are 
pointed to by (resp. point to) resource u, i.e. 

ResFrom(u) = { o | (u,p,o) € K,p e Pr} 
ResTo(u) = { o | (o,p,u) € K,p G Pr} 

In our running example we have: 

ResFrom( SherlockH olmes) = {England, GuyRitchie, JudeLaw, Mystery, SheriockH olmesB* 
We define the classes and the superclasses of a resource u as: 

Classes(u) = { c | (u,type,c) G K} 
SuperClasses(u) = { c | (u, subClassO f ', c) G K} 

6 Specifically DaVinciCode Book, Illuminati Book, Sherlock Holmes Book, Dan 
Brown, and Conan Doyele. 
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Fig. 3. The RDF graph G of our running example 



For example in Figure |3] we have: 

Classes(IlluminatiBook) — {MysteryNovel} while SuperClasses(MysteryNovel) — 
{Novel}. Obviously if an element a; is a class then, Classes(x) = 0, while if x is 
an instance of a class then superClasses(x) = 0. 

Some notations for edges follow. We define the set of classification and inheri- 
tance links of a resource u and a class c as: 



ClassLinks(u) = { (u,c) \ {u,type,c) G K} 

SupLinks(c) — { (c,c)\(c,subClassOf,c) G K} 

The inbound and outbound property links of a resource u are defined as: 

PropsFromLinks(u) — { {u,o) \{u,p,o) G K,p £ Pr} 
PropsToLinks(u) = { (o, u) \(o,p, u) G K,p G Pr} 
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Now we extend the above definitions to take as parameter a set (S) of resources, 
so we have: 

ResFrom(S) = U u esResFrom(u) 
ResTo(S) = U ueS ResTo(u) 
PropsFromLinks(S) = U u esPropsFromLinks(u) 
PropsToLinks(S) = U u esPropsToLinks(u) 

Classes(S) — U u esClasses(u) 
SuperClasses(S) = U u es SuperClasses(u) 
ClassLinks(S) = U u esClassLinks(u) 
SupLinks(S) = U u es SupLinks(u) 

A path over G, is any sequence of edges of the form: (A, P, C), (C, P',D), • • • , (E, P", u), 
where all predicates (P, P' , ..P") are either properties in Pr or the predicate type 
or the predicate subClassOf . 

We define the distance between two nodes A and B over G, denoted by distc(A, B), 
as the length of the shortest path from A to B. If no path exists then the distance 
is assumed to be infinite. 

5 Similarity Function 

In this section, we will introduce and analyze, step by step, the proposed similarity 
metric, over the running example of Fig. [3] Suppose we want to compute the 
similarity between two nodes A and B of the RDF graph G. At first we define the 
subgraphs of A and B of radius k, denoted by: 

9A(k) = (N k (A),E k (A)) 
9B(k) = (N k (B),E k (B)) 

They consist of all nodes and edges that are visited if we start from A and B 
respectively, and traverse all links (properties, type, subclassOf) for depth up to 
k where the value of k is configured externally (and it will be discussed later on). 

These graphs can be computed in an iterative manner. For instance, for defining 
SU(fe) we start fr om ffA(o) = ( N o(A),E {A)) where N (A) = {A} and E (A) = 0. 
Subsequently, from 

SA(i-i) = {Ni-i(A),Ei-x(A)) we can compute 

g A(l) = (Ni(A),Ei(A)) (for all 1 < % < k- 1), as follows: 

Ni(A) = Ni-i(A) U 

ResFrom(N^i(A)) U 

Classes(Ni^i(A)) U 

SuperClasses(Ni_i(A)) 
Ei(A) = Ei^i(A) U 

PropsFromLinks(Ni^i(A)) U 

ClassLinks(Ni^i(A)) U 

SupLinks(Ni_i(A)) 
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Each step of the iteration enriches the current set of nodes A^_x(A) with the 
nodes: 

— which are classes of a node in A r ,_i(A) (since classes carry important informa- 
tion) , 

— the values of the properties that start from the nodes in Ni-%(A) (they are 
actually attribute values), 

— the superclasses of the nodes in Ni-i(A) (for climbing up the subClassOf 
hierarchy) 

The iterative expansion allows collecting values of complex attributes, as well 
as higher level superclasses (in this way we can detect similarities even between 
very "distant" entities which belong to different class hierarchies). 

We should stress at this point, that one could adopt a different policy regarding 
how a subgraph expands. For instance, one could also expand the graph using 
properties which point to the current set of nodes (in that case ResTo(Ni_i(A)) 
would be added to Ni(A) and PropsToLinks(Ni-i(A)) to E^A) ). The decision 
is application or ontology specific. [16117] have also made the observation that 
it is often not enough to use a single similarity measure to achieve good results, 
therefore a combination of features needs to be engineered or even learned. In our 
case we decided to take only the forward property direction since in most cases a 
property is more important for its origin than for its destination. 

To better illustrate the construction of the subgraph, consider the graph G 
of Figure [3] and suppose that A — DaVinci Code and B = Illuminati. The 
subgraphs 5^4(3) and gB(3) are shown at Figure 2] and Figure [S] respectively (the 
latter depicts all subgraphs for k = to k = 3). 

Table [T] shows the distances dist gA (A,u) and dist gB {B,u) for various u nodes. 
The nodes for which both dist gA (A, u) and dist gB (B,u) are defined (i.e. both are 
different than 00), actually belong to the intersection of the nodes of the two 
subgraphs, while the rest are nodes that belong only to one of the subgraphs. 

After having constructed the graphs gA and gs , one could compute the sim- 
ilarity between A and B by applying the Jaccard similarity coefficient |15j over 
their node sets, i.e. between N(A) and N(B), as follows: 

^ a Mffl (1) 

In our example the intersection between Na(A) and Na(B) is illustrated (verti- 
cally aligned) at the center of Figure [6] where for reasons of space we do not show 
the schema level intersections. 

Note that by considering the nodes at depth greater than 1, we can identify 
similarities between resources of different types. If resources of different types are 
compared (e.g. a film with an actor), they will rarely have the same properties in 
small depth (e.g. for k — 1) and therefore we will not get many (or any) intersecting 
nodes. 

Obviously the similarity value obtained depends on the value of k. For example, 
for k = 1 we get: 

sinii(DaVinciCode, Illuminati) = = 0.26 




Fig. 5. <7b(3) where B =Illuminati 
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Fig. 6. Intersection between Muminati and DaVinci Code Subgraphs 



while for k = 3 we get 

simz{DaVinciCode 1 Illuminati) = |? = 0.61 

However a shortcoming of this approach, is that a common node spotted at 
depth 1, is equally weighted as a common node of a larger distance. For this 
reason below we introduce a different similarity function which takes into account 
the values dist gA (A, u) and dist 9B (B, u). We should clarify that this extension does 
not increase the computational cost of the similarity function since these distances 
are computed anyway during the construction of the subgraphs gA(k) an d <7s(fc)- 

To understand the extension we shall first express function (JTJ) in a different, 
but equivalent, manner: 

simk{A B) = E^^n^B))! (2) 

Z^ne(N k (A)uN k (A)) 1 

This form makes evident that each element in the intersection or union contributes 
the value of one. Now we will introduce the new formula in which each element in 
the intersection or union does not contribute the value of one, but a value based 
on its average distance from nodes A and B. 

Since the closest node is at distance 1 while the most distant is at distance k 
(or infinite) we shall use the expression k + 1 — dist for giving to the closest nodes a 
contribution equal to k and to the more distant nodes a contribution equal to 1. If 
a distance equals oo we consider it as k + 1 . In this way the expression k + 1 — dist 
yields a zerqj. 

7 This means that the cells of Table fj] that have an infinite value (oo) are actually 
considered to have the value k + 1, i.e. 4. 
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u 


dist gA (A, u) 


distg B (B, u) 


Genre 


2 


2 


Actor 


2 


2 


Film 


1 


1 


Director 


2 


2 


Location 


2 


2 


Novel 


3 


3 


Mystery Novel 


2 


2 


Writer 


2 


2 


Mystery 


1 


1 


Ian McKellcn 


1 


oo 


Carnelutti 


1 


oo 


Tom Hanks 


1 


1 


Victor Alfiery 


oo 


1 


Ewan McGregor 


oo 


1 


Ron Howard 


1 


1 


Italy 


2 


1 


Scotland 


oo 


2 


England 


2 


oo 


DaVinci Code Book 


1 


oo 


Hluminati Book 


oo 


1 


Dan Brown 


2 


2 



Table 1. Distances from A and from B 



The proposed similarity function is denned as: sirrik{A, B) = 

E(k' —dist 9A (A,n))-\-(k f — distg B (B ,n)) 
n£(N k (A)nN k (B)) ' 2 

^ {k'-distg A (A,n)) + (k' -dist gB (B,n)) ^ ' 

l^ne(N k (A)UN k (B) 2 

where k' = k + 1. 

If we apply to our running example we now get: 

siiri3(DaVinciC ode, Hluminati) = 2 |^. — 0.7 

In brief, the proposed similarity between two nodes A and B is actually the 
Jaccard similarity coefficient evaluated over the nodes of the extended neighbor- 
hoods of the compared nodes. Each node of the neighborhoods is weighted so that 
the nodes closer to the compared nodes get a greater weight than the distant ones. 



5.1 Properties of the Similarity Function 

For any resource u, and for any positive integer k it holds: sirrik(u, u) = 1. 

It is also clear that the metric is symmetric i.e. sirrik{a,b) — sinik(b,a). 

Although in the examples that we have seen earlier it happens to hold: if 
m > m' then sim m (a,b) > sim m i(a,b), in the general case this does not hold. 
The reason is that for a high k we may have several non intersecting sets of nodes 
which increase the denominator of the similarity function. 
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6 Examples and Analysis 
6.1 Behavior 

Table[2]shows the computed similarities between the films DaVinci Code, Illuminati 
and Sherlock Holmes, for k = 1, 2, 3. We observe that the most similar movie with 
DaVinci Code, is Illuminati (and not Sherlock Holmes) for all values of k from 
1 to 3. 



k 


sirrik (DaVinciCode, 


sirrik (DaVinciCode, 




Illuminati) 


SherlockHolmes) 


1 


0.53 


0.30 


2 


0.67 


0.54 


3 


0.70 


0.58 



Table 2. Similarity for different values of k 



Let us now use some examples to justify the benefits of k values higher than 
1, and to better understand the behavior of the similarity function. Table |3] shows 
the computed similarities between the nodes A, B, C and D, for k = 1 . . . 3, for the 
example shown at Figure [7^1). We observe that for k — 1, B is the most similar 
to A since they are under the same class, while the similarity of A with C and D is 
zero. However for k = 2 the similarity of A with C and D is not zero, and C is more 
similar than D. 

To demonstrate the potential of the similarity function to exploit commonalities 
in property paths, Table [4] shows the computed similarities between the nodes A, 
B, C and D, for k = 1 . . .3, for the example shown at Figure [7^11) . We observe that 
for k — 2 A is more similar to C than to D because even though they do not have 
any direct value in common, vl and v2 are under the same class CI, and v4 is a 
common value at depth 2. Notice that the similarity between A and D is not zero 
for k = 2, due to the value vA. 
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k 


sim k (A, B) 


sim k (A,C) 


sim k (A,D) 


l 


1 








2 


1 


0.60 


0.33 


3 


1 


0.625 


0.40 



Table 3. Similarity for different values of k over Fig. [7JI) 



k 


sim k (A, B) 


sim k (A,C) 


sim k (A,D) 


1 


1 








2 


1 


0.50 


0.25 


3 


1 


0.57 


0.28 



Table 4. Similarity for different values of k over Fig. 0(11) 



It is also worth noting that the most similar entity can change as k changes. 
For instance, in the example of Figure EJIII), as we can see from Table [3 for k = 1 
the most similar to A is the entity C, while for k — 2 (and higher) the most similar 
to A is the entity B. 



k 


sim k (A, B) 


sim k (A,C) 











1 





0.40 


2 


0.60 


0.46 


3 


0.625 


0.47 



Table 5. Similarity for different values of k over Figure [Tf III) 



6.2 Computational Complexity 

Let d be the average number of edges which are adjacent to a node. For a node A, 
the number of nodes in the graph gA(k) is a t most in 0{d k ). This is therefore the 
cost of sim k (-, ■). 

6.3 On Selecting a value for k 

One issue that plays an important role in the computation of similarity is the 
choice of the appropriate k. The choice can be made by the application designer 
(or even by the end user at run-time). By choosing a greater k more complexity 
is added to the computation of the similarity and this is the cost to pay for more 
accurate results in the sense that a wider part of the graph is taken into account. 
By choosing a lower k the computational cost gets decreased, but the results may 
not be as accurate as the user would like. 

One method for selecting a k is to measure graph features of the RDF/S graph, 
e.g. the diameter of the graph. 
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6.4 Variations of the Similarity Function 

As one may have noticed, the similarity function ignores the names of the proper- 
ties. 

The benefit of this choice is that the function can yield positive similarities 
also between objects that use different properties. For example consider the triples 
(a, hasFriend, e) and (b, worksFor, e). The similarity function will return 
a positive value for simi(a,b) although these entities have different properties. 
It would be zero if the property names were taken into account. However, the 
shortcoming is inability to promote matches also at the properties. For example, 
if we had another triple (c, hasFriend, e) then we would have sim\{a,c) = 
sim\{a, b), although we would prefer simi{a, c) > sim\{a, b). 

If we wanted to take into account the property names then we could prefix the 
nodes of the subgraphs which are reached from properties by the corresponding 
property name. In particular, instead of 
ResFrom(u) = { o \ [u,p, o) € K,p € Pr}, we could define 
ResFrom' (u) — { p : o (u,p,o) G K,p G Pr}, 

where "p : o" is treated as one string. Clearly, with such a change, the new simi- 
larity function, denoted by sim', would yield sirriiia^c) > sim'^a, b) = 0. 

One approach to reconcile the two approaches is to change the graph expansion 
step so that both ResFrom(u) and ResFrom' (u) are used for the definition of 
the nodes of the subgraphs. Specifically Ni(A) can now be defined as: 

Ni{A) = Ni-i(A) U 

ResFrom{N l ^i{A)) U 
ResFrom' (N^A)) U 
Classes(Ni-i(A)) U 
SuperClasses{Ni_i{A)) 

In this way we will get sim'{(a, c) > sim" (a, b) > 0. 

6.5 Experimental Results 

We created a bigger KB for testing the similarity function, i.e. for judging whether 
it returns intuitive results and for investigating how the value of k affects the 
results. 

[Setup of the KB ] 

Our measurements were based on a KB that we created by extracting data from 
LinkedMDB, through Virtuoso's SPARQL Endpoint, with explicit queries. More 
specifically, we selected and downloaded 10 entities, that were quite relevant to 
each other. For each one of them we expanded their subgraphs for depth 3, and 
with the fetched information we created a KB on which our measurements were 
conducted. The entities that were chosen and their types are shown in Table [6j 

The resulting KB contained: 16 classes, 70 properties, 3326 resources, 4301 
property instances, and 4877 triples in sum. 
[Top-3 Results] 

We computed the similarity between every pair of these 10 entities for all k = 
1, 2, 3. Table |S] shows the top-3 most similar entities for each entity. 
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Angels and Demons 


Film 


The DaVinci Code 


Film 


That Thing You Do! 


Film 


Original Sin 


Film 


Jude 


Film 


Catch Me If You Can 


Film 


Leonardo DiCaprio 


Actor 


Tom Hanks 


Actor 


Phil Alden Robinson 


Director 


Joe Dante 


Director 



Table 6. Selected (seed) entities 



Entity 


Top-3 more similar entities 




sim\ 


sirrt2 


sirri3 


The Da Vinci Code 


( Angels and Demons, 
That Thing You Do!, 
Catch Me if You Can ) 


{ Angels and Demons, 
That Thing You Do!, 
Catch Me if You Can ) 


{ Angels and Demons, 
Catch Me if You Can, 
That Thing You Do! ) 


Angels and Demons 


{ The Da Vinci Code, 
That Thing You Do!, 
Catch Me if You Can ) 


( Tom Hanks, 

The Da Vinci Code, 

That Thing You Do! ) 


( Tom Hanks, 

The Da Vinci Code, 

That Thing You Do! ) 


Tom Hanks 


{ Leonardo DiCaprio, 
Phil Alden Robinson, 
Joe Dante ) 


( Angels and Demons, 
Leonardo DiCaprio, 
That Thing You Do! ) 


( Angels and Demons, 
Leonardo DiCaprio, 
That Thing You Do! ) 


That Thing You Do! 


{ Catch Me if You Can, 
Angels and Demons, 
The Da Vinci Code ) 


( Catch Me if You Can, 

Tom Hanks, 

Angels and Demons ) 


{ Phil Alden Robinson, 

Tom Hanks, 

Angels and Demons ) 


Original Sin 


{ Jude, 

Angels and Demons, 
That Thing You Do! ) 


( Jude, 

Angels and Demons, 
That Thing You Do! ) 


( Jude, 

Angels and Demons, 
The Da Vinci Code ) 


Jude 


{ That Thing You Do!, 
Angels and Demons, 
Original Sin ) 


( Angels and Demons, 

Original Sin, 

That Thing You Do! ) 


{ Phil Alden Robinson, 
Angels and Demons, 
Original Sin ) 


Catch Me if You Can 


{ That Thing You Do!, 
The Da Vinci Code, 
Angels and Demons ) 


( That Thing You Do!, 
The Da Vinci Code, 
Angels and Demons ) 


( Joe Dante, 

The Da Vinci Code, 

That Thing You Do! ) 


Leonardo DiCaprio 


{ Tom Hanks, 

Phil Alden Robinson, 

Joe Dante ) 


{ Tom Hanks, 

Catch Me if You Can, 

Angels and Demons ) 


{ Tom Hanks, 
Angels and Demons, 
Catch Me if You Can ) 


Phil Alden Robinson 


{ Joe Dante, 
Tom Hanks, 
Leonardo DiCaprio ) 


( That Thing You Do!, 
Catch Me if You Can, 
Angels and Demons ) 


( That Thing You Do!, 
Catch Me if You Can, 
Jude ) 


Joe Dante 


{ Phil Alden Robinson, 
Tom Hanks, 
Leonardo DiCaprio ) 


( Catch Me if You Can, 
Phil Alden Robinson, 
That Thing You Do! ) 


( Catch Me if You Can, 
The Da Vinci Code, 
Phil Alden Robinson ) 



Fig. 8. Comparative Results for sim 



We can observe that for some entities, the 3 most similar entities change when 
k changes. For example, the 3 most similar entities for Tom Hanks and k = 1, are: 
( Leonardo DiCaprio, 
Phil Alden Robinson, 
Joe Dante } 

while for k = 2, 3 they are: 
( Angels and Demons, 
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Entity 


Top-3 more similar entities 




sim\ 


sim.2 




_1 lie V 111C1 v^UI_lt3 


{ Angels and Demons, 
That Thing You Do!, 
Catch Mc if You Can ) 


{ Angels and Demons, 
That Thing You Do!, 
Catch Mc if You Can ) 


{ Angels and Demons, 
That Thing You Do!, 
Catch Mc if You Can ) 


Angels and Demons 


{ The Da Vinci Code, 
That Thing You Do!, 
Catch Mc if You Can ) 


( Tom Hanks, 

The Da Vinci Code, 

That Thing You Do! ) 


( Tom Hanks, 

The Da Vinci Code, 

That Thing You Do! ) 


Tom Hanks 


{ Leonardo DiCaprio, 
Phil Aldcn Robinson, 
Joe Dante ) 


( Angels and Demons, 
Leonardo DiCaprio, 
That Thing You Do! ) 


( Angels and Demons, 
Leonardo DiCaprio, 
That Thing You Do! ) 


Tliaf TViincr Vnn Tin! 
J. lldL _1_ lllllg IUU UOI 


{ Catch Mc if You Can, 
The Da Vinci Code, 
Angels and Demons ) 


( Catch Mc if You Can, 

Tom Hanks, 

Angels and Demons ) 


{ Phil Aldcn Robinson, 

Tom Hanks, 

Angels and Demons ) 


Original Sin 


{ Judc, 

Angels and Demons, 
that thing You Do! ) 


( Judc, 

Angels and Demons, 

r~T~~ll i mi ; _ _ ~\r , _ t-v 1 \ 

that Ining You JJo! ) 


( Judc, 

Angels and Demons, 
the Da Vinci Code ) 




{ That Thing You Do!, 
Angels and Demons, 
Original Sin ) 


( That Thing You Do!, 
Angels and Demons, 
Original Sin ) 


{ Phil Aldcn Robinson, 
That Thing You Do!, 
Angels and Demons ) 


Catch Mc if You Can 


{ That Thing You Do!, 
The Da Vinci Code, 
Angels and Demons ) 


( That Thing You Do!, 
The Da Vinci Code, 
Angels and Demons ) 


( Joe Dante, 

The Da Vinci Code, 

That Thing You Do! ) 


Leonardo DiCaprio 


{ Tom Hanks, 

Phil Aldcn Robinson, 

Joe Dante ) 


{ Tom Hanks, 

Catch Me if You Can, 

Angels and Demons ) 


{ Tom Hanks, 
Angels and Demons, 
Catch Mc if You Can ) 


Phil Aldcn Robinson 


{ Joe Dante, 
Tom Hanks, 
Leonardo DiCaprio ) 


( That Thing You Do!, 
Catch Me if You Can, 
Angels and Demons ) 


( That Thing You Do!, 
Catch Me if You Can, 
Jude ) 


Joe Dante 


{ Phil Aldcn Robinson, 
Tom Hanks, 
Leonardo DiCaprio ) 


( Catch Mc if You Can, 
Phil Aldcn Robinson, 
That Thing You Do! ) 


( Catch Mc if You Can, 
The Da Vinci Code, 
Phil Aldcn Robinson ) 



Fig. 9. Comparative Results for sim' 



Leonardo DiCaprio, 
That Thing You Do!). 

We also observed that for k = 1 for some entities we could not get any similar 
entity. Therefore higher values of k are beneficial. 

[Comparison with sim" ] 

At Section 16.41 we described a variation of the similarity function, denoted by 
sim" . Table [5] shows again the top-3 most similar entities (as in Table [5J when 
using sim". We observe that the results are quite similar to those of Table [51 in 
most times only the relative ordering of the three more similar entities differs. 

[Times] 

The average time to compute simkQ between two randomly selected resources, for 
k = 2 equals 3 milliseconds, while for k = 3 equals 32 milliseconds. All experiments 
were carried out in a computer with processor Intel(R) Core(TM)2 Duo @2.40GHz, 
2 GB Ram, running Microsoft Windows 7 Ultimate. 

7 Implementation Approaches 



Here we discuss implementation issues. 
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[The Straightforward approach] 

One could attempt to compute the similar entities at run-time during the construc- 
tion of the page at hand. However, that would not be efficient in the sense that a 
lot of information would have to be fetched and processed. In particular, to com- 
pute the similar entities for an entity A we should compute the values sim,k{A, x) 
for all possible resources x. The cost could be reduced by limiting the set of values 
that x may take. Specifically, we can first specify the classes of the possible similar 
entries, in our case the classes of actors, directors, editors, movies, writers (as we 
described at Section [5]), and then download all information available only for these 
resources. In any case that would be unacceptably slow and inefficient for large 
KBs. 

[The Single Repository (and Preprocessing) approach] 

An alternative approach is to download and process the entire KB (e.g. as we did 
in the previous section). Since for each entity we need to show only the L (e.g. 
L=5) most similar entities, we can compute offline the L most similar entities for 
each entity of the classes of interest, and then store these L resources (e.g. in main 
memory) for immediate use at run time. Recall that current WSE (Web Search 
Engines) also compute off-line and store for each page the 20 most similar pages. 
This preprocessing can be done offline, before the deployment of the application, 
and it can be periodically redone as new information becomes available at LOD. 

[A Similarity-Reversal approach] 

An alternative and more challenging implementation approach is sketched below. 
One could attempt to "reverse" the similarity function, i.e. try traversing the 
graph around A and collect those entities which have high chances to be in the 
top-L most similar entities, and compute the similarities only for them. Such an 
approach does not require any preprocesssing and could be feasible at run time. Its 
feasibility also depends on how exactly the similarity function is defined. Below 
we will elaborate on such an approach. The presented approach can be applied 
to our similarity metric, as well as to other similarity metrics whose computation 
requires analyzing the subgraphs of the compared entities. The ultimate objective 
is to devise efficient top-fc algorithms (in the spirit of |8I9| ). appropriate for graph- 
based similarity measures. Nevertheless, such a method cannot be faster than the 
preprocessing method. On the other hand, the benefit of adopting such a method 
is that it does not require having access (or ability to store) the entire KB. We 
should note that [12] also proposes to query the Web of Linked Data by traversing 
RDF links during run-time since due to the openness of the LOD space it may not 
be possible to know in advance all data sources that might be relevant for query 
answering. We should stress at this point that our problem is more difficult since 
we do not want to evaluate a single SPARQL query but to find the most similar 
entities and this in general requires the evaluation of several queries. 

7.1 On Reversing the Similarity Function 

Consider an entity A and suppose that we want to compute the more similar 
entities to A. This requires computing the subgraphs of A as well as the subgraphs 
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of the other entities of the KB. Below we will study this problem by considering 
one kind of graph expansion at a time. 

• ResFrom(-)-graph expansion. 

Suppose the graph expansion is defined only by ResFrom(-). It is not hard to see 
that for each x £ ResTo(ResFrom(A)) it holds: 
ResFrom(A) n ResFrom(x) 0. 

Let X rf (A) = ResTo(ResFrom(A)). Moreover if x' <£ X rf (A), then ResFrom(A)C\ 
ResFrom{x') = 0. This means that the nominator of the similarity function is cer- 
tainly greater than zero only for these entities. 

• C7asses(-)-graph expansion. 

For this expansion method, it is not hard to see that for each x £ X c i(A) = 
Instances(Classes(A))) it holds Classes(A) n Classes(x) ^ 0. 

• SupClasses(-)-gra,ph expansion. 
Analogously, for each 

x £ X sp (A) = SubClasses(SuperClasses(A))) it holds 
SuperClasses(A) n SuperClasses(x) ^ 0. 

It follows from the above that all elements of X U (A) = X r f(A) U X c i(A) U 
X sp (A), and only these elements, have certainly non zero similarity. 

Let now discuss the case where k > 1. In general a value of k greater than 
one specifies a set of expansion paths. We can follow these expansion paths to get 
the nodes of subgraph for A, and then "reverse" the expansion paths and apply 
them to the ending nodes of the graph of A. This should be done with care, since 
although a path can have length 3 (i.e. k = 3), an ending node of the subgraph 
could be the result of an expansion of shorter length (e.g. of one), implying that 
reversed paths should be shorter too. 

The application of these reversed paths, can give us the candidate entities. 
This is actually what we have described above for the case where k — 1 . Below we 
describe in detail this process for any value of k. 

Consider the set of strings Directions = { ResFrom, ResTo, Classes, Instances, 
Subclasses, Superclasses}. A graph expansion step over RDF/S can be specified 
by a subset of this set. For instance, the graph expansion used by the proposed 
similarity metric is specified by the set {ResFrom, Classes, Subclasses}. We can 
define the "reverse" of a direction as: 



For a subset S C Directions, we define Rev(S) = U se sR^v(s). 

The Algorithm getC andidateSimilar (shown at Fig. I10|) takes as input a node 
A, the value k, and a policy being a subset of Directions. It returns those objects 
which have high chances to be very similar to A (actually those whose similarity 
with A is certainly positive) assuming sinik over subgraphs defined using the 
directions in policy. 



Rev (ResFrom) 
Rev(Classes) 
Rev(SubClasses) 



Instances 



ResTo 



Superclasses 
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Algorithm getCandidateSimilar 
Input: A, fc, policy 
Output: A set of resources 

(1) R = 0; 

(2) compute g k {A) — (N k (A), E k (A)) w.r.t. policy 

(3) For each n £ N k (A) 

(4) let d = dist(n, A) 

(5) R — R U traver se(Rev(policy)) ,n, d) 

(6) End for 

(7) return 7?; 

Fig. 10. Alg. for getting the resources which have "similar" subgraphs to A using 
sirrik 

At line (2) the algorithm computes the subgraph of A according to the direc- 
tions set in policy. The distance at line (4) has been computed during line (2). 
The invocation traver se(dirs,n,d) starts from n and follows the links that cor- 
respond to the argument dirs, for up to distance d, and returns the encountered 
nodes. To make it more clear the set of nodes Nk(A) (at line 2) can be computed 
by Nk(A) = traver se(j>olicy , A, k). Regarding the correctness of the algorithm, as 
explained earlier, only the elements in the returned R can have non zero similarity 
to A. After having run the algorithm, the next step is to compute sim k (A, r) for 
each r £ R and return the more similar elements. Specifically, for each r £ R we 
should get all information returned by traver se(policy, r, k). With these informa- 
tion we can compute simk(A, r). This can be done either by code or with queries. 
For instance, simi(A, B), assuming that the subgraphs of A and B are defined 
only by Classes^-), can be computed with a query of the forrrH: 

SELECT 

(count (distinct ?classl) as TintersCard) / 
(count (distinct ?class2) as ?unionCard) 
as ?res WHERE { 
{ 

A rdf:type ?classl. 
B rdf:type ?classl. 
} UNION-C 

{ A rdf:type ?class2. } 

UNION 

{ B rdf:type ?class2. } 

} 

} 

The above query can be extended to capture also the rest graph expansion 
steps. However the case where k > 1 requires the formulation of much more com- 
plex queries. It is easier to do the required computation with a programming 
language than with a query language. 



To be more precise the division has to be casted using XSD data type. 
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We have just seen how we can collect only those elements with positive simi- 
larity to A, by first getting the subgraph of A, and then reversing the expansion 
paths that defined the subgraph of A. 

[Top-L Algorithm] 

The above algorithm can be extended to become a top-L algorithm, in case we 
are interested in finding only the L more similar entities. Let's start from the case 
where k = 1 and suppose that the cardinality of the set X U (A) is high. Since we 
are interested in finding the L most similar to A entities, we can adopt a different, 
more efficient, evaluation approach, specifically we can avoid collecting all elements 
that will be fetched at line (5) of the algorithm getC andidateSimilar . The idea 
is to collect at first those elements in X(A) n = X p (A) nl c i(A) (lX sp (A). Clearly, 
the elements in X(A) n will have a positive summand for each part of the similarity 
function, and thus have high probability to contain the L most similar entities. 
If they are more than the desired number of objects i, i.e. if |X n (A)| > L, then 
we can rank them and present the L most similar entities. The benefit of this 
method, in comparison to collecting the elements of the entire X V (A) (i.e. line 
(5)), is that the elements of X n (A) apart from being less, they can be fetched 
efficiently, specifically with one query. 

For instance, the set X r f (A) can be computed by the following SPARQL query: 

SELECT ?y 

WHERE { A ?pl ?x. 

?y ?p2 ?x. 

FILTER ( ?pl != rdf:type && 
?p2 != rdf:type) } 

Note that if we wanted to use ResFrom' instead of ResFrom, then we would 
have to use the query: 

SELECT ?y 

WHERE { A ?p ?x. 

?y ?p ?x. 

FILTER ( ?p != rdf:type ) } 

The set X c i(A) can be computed by the following SPARQL query: 
SELECT ?y 

WHERE{ A rdf:type ?x. 

?y rdf:type ?x.> 

The set X sp (A) can be computed by the following SPARQL query: 

SELECT ?y 

WHERE{ A rdfs : subClassOf ?x. 

?y rdfs:subClassOf ?x. > 

Now X rf (A) n X d (A) n X sp (A) can be computed by the following SPARQL 
query: 

SELECT ?y 

WHERE{ A ?pl ?x. 

?y ?p2 ?x. 
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A rdf:type ?z. 
?y rdf:type ?z . 

A rdf s : subClassOf ?w. 
?y rdf s : subClassOf ?w. 

FILTER ( ?pl != rdf: type && 
?p2 != rdf: type) } 

Note that the above query can give a non empty result only if A is at class 
level and thus can have superclasses. 

If however the fetched elements are less than L, i.e. if |JT n (A)| < L, then we 
have to fetch more elements. We can start collecting those elements that belong 
in intersections of two of the above sets, i.e. the elements in X p (A) n X c i(A), 
X p (A) n X sp (A), and X c i(A) n X sp (A). 

For example, X r f(A) n X c i(A) can be computed by the following SPARQL 
query: 

SELECT ?y 

WHERE{ A ?pl ?x. 

?y ?p2 ?x. 

A rdf: type ?z. 
?y rdf: type ?z. 

FILTER ( ?pl != rdf: type && 
?p2 != rdf: type) 

} 

If again the fetched elements are less than L, then we can collect those in 
X r f(A) U X c i(A) U X sp (A), i.e. run the original line (5). These elements can be 
fetched using the following query 

SELECT ?y 
WHERE { 

A ?pl ?x. 
?y ?p2 ?x. 
FILTER ( ?pl != rdf: type && 
?p2 != rdf: type) 

} 

UNION { 

A type ?z. 
?y type ?z. 

} 

UNION { 

A subClassOf ?w. 
?y subClassOf ?w. 

} 

Essentially the main idea is the following. If the subgraph is defined by a set 
of directions dirs, then instead of reversing each one direction in isolation and 
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getting the union, try reversing all directions at once. Then all directions except 
one, and so on. In other words, it is like starting from the top node of the Hasse 
diagram of the powerset of dirs (P(dirs), C) and then descend level wise. E.g.: 

{P,C S} : level 1 

/ I \ 

{P,C> {P,SMC,S} : level 2 

I \ / I V I 
I / \ I A I 
{P> {C} {S} : level 3 



A 


\Answer(q)\ 






Xrf(A) 


X cl (A) 


X rf (A)nX cl (A) 


DaVinciCode 


82 


76 


81 


75 


Tom Hanks 


185 


5 


182 


2 


The Thing You Do! 


82 


75 


81 


74 



Fig. 11. Measurements over the local KB 



A 


| Answer(q) 






X r fud(A) 


Xrf(A) 


X cl (A) 


X rf {A)nX cl {A) 


Americano 


1,679,605 


32,318 


1,679,318 


32,031 


DaVinciCode 


1,683,729 


246,918 


1,668,503 


231,692 


Illuminati 


1,676,081 


98,032 


1,668,503 


90,454 


Tom Hanks 


2,218,574 


862,458 


2,183,320 


827,204 



Fig. 12. Measurements over DBPEDIA 



Below we report the number returned resources, for various entities and for 
various queries, including the query that returns the union of X r f(A) and X c i(A), 
denoted by X r f Uc i(A), defined as: 

SELECT ?y 
WHERE{ A ?p ?x. 

?y ?p ?x. 

FILTER (?pl != rdf:type && 
?p2 != rdf:type) 

} UNION { 

A rdf:type ?z. 
?y rdf:type ?z. 

} 

We did not manage to obtain reliable results for the above queries over the 
LinkedMDB SPARQL endpoint, since for some reason it does not return very big 
answers. Therefore at Table Qj] we report some indicative (and quite predictable) 
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results over the local KB. Even in this toy KB we can see how the resources are 
reduced while the required time does not increase a lot. 

To get more realistic results, we tried the SPARQL endpoint of DBPEDLA0. If 



Measurements for other entities are shown at Table [T2] We observe some big 
reductions in the answer set (from millions to tens of thousands). However, even 
for the intersection query the returned answer is quite big; 32 thousands hits 
although much less than millions, are probably many for fast real-time interaction. 
One approach to tackle this problem is to try formulating even more restrictive 
queries which capture the desired characteristic of similarity function in a more 
accurate way. The extra condition(s) can be added to the query as extra graph 
pattern, or the query can be enriched with an appropriate order by clause. In 
the latter case the application can consume only the top hits of the ranked hits of 
the computed answer. 

The general approach would be to enrich the query with aggregated counts or 
similarity functions aiming at reaching a query that directly returns ranked the 
top-L similar entities. However this is not always possible (depends on how the 
similarity metric is defined), and in some times this approach is expected to be 
less efficient than getting through queries the information that is needed and then 
rank the entities using programming language code. Of course the availability of 
LOD SPARQL endpoints which support extended versions of SPARQL would be 
useful. For instance, |17j investigates methods to integrate customized similarity 
functions into SPARQL. Among the proposed techniques, it seems that the, so 
called virtual triple approach, would be beneficial (shorter queries which are easier 
to write, optimization potential). However, the scenarios described are more simple 
in the sense that only on the direct neighborhood of the compared entities is taken 
into account, and similarity thresholds should be adopted (instead of a parameter 
L). This direction should be further researched. In general, there is a need for 
semantic query optimization techniques for similarity queries. 

Another important point, which is independent of the query language, is that 
the refinement of the information that is available in the LOD cloud, i.e. the clas- 
sification of the available resources to more refined classes, is expected to improve 
not only the quality of the computed similarities, but will make the computation 
of the similar entities more efficient. Specifically, if entity A were not classified 



http: / /dbpedia.org/sparql 
http: / /dbpedia.org/resource/The_Americano 



A is the movie American' 



c0 then 




1,679,318 (i.e. all films) 
32,031 



32,318 
32,094 



1,679,605 



Similarity-based Browsing over Linked Open Data 



25 



only as film, but to more refined classes (e.g. Thriller, Anti-war Film etc), 
then |X C ;(A)| would be smaller. 

Above we have sketched a top-L version of the algorithm and identified eval- 
uation approaches and difficulties, for the case k = 1. If k is greater than one, 
then one approach is to start from a k = 1 and apply the above algorithm. If 
the fetched elements are less than L then move to k' = 2, and so on, until having 
fetched L elements or reached the original value of k (i.e. until k' = k). However, 
as we saw in the example of Figure III) , such an approach does not guarantee 
that the top-L similar entities with respect to simi are the same with the top- 
L similar with respect to simu (nevertheless this approach could be used as an 
approximation) . 

Probably, the best feasible solution, for the time being, is to define, store and 
periodically update, materialized views accessible through LOD endpoints, which 
for each entity contain the set of most similar entities. 

8 Conclusion 

In this paper, we motivated the need for similarity-based browsing over entities 
which are semantically defined. This kind of browsing can be applied for various 
kinds of entities e.g. for movies, paintings, photographs, videos, restaurants, or even 
social entities (groups or individual persons). We introduced a similarity metric 
which is type- independent, meaning that it can find similarities between entities of 
different type (for example similarities between an actor and a movie) , which is very 
convenient for similarity-based browsing. The way the similarity metric functions 
is somehow similar with the spreading activation retrieval method proposed for 
semantic networks [B]. The metric can also be configured (the radius k as well as 
the graph expansion policy) according to the characteristics of the corpus at hand 
(and the " affordable" computational complexity) . We demonstrated the behavior 
and the benefits of this metric over a LOD-based application offering similarity- 
based browsing for movie information. We believe that this metric can also be 
useful in semantic search [TU] . We do not argue that the graph expansion method 
adopted by the similarity function is the best for all occasions. Instead we have 
the impression that in many cases the selection of the graph expansion method 
should be application specific. 

Finally, we discussed implementation approaches and we elaborated on a method 
which is "harmonized" with the distributed and open nature of LOD. The de- 
scribed method can be used for computing the L most similar entities according 
to similarity metrics which are neighborhood-based. Specifically we showed how 
a neighborhood-based similarity metric can be reversed to get a query which can 
collect only those entities whose similarity is certainly greater than 0. Furthermore 
we sketched possible top-L extensions of the algorithm. 

Below we discuss some directions which according to our opinion are worth 
further research. Regarding similarity functions there is a need for test collec- 
tions appropriate for comparative evaluation. Regarding algorithms, it is worth 
investigating top-LT (or nearest K) algorithms appropriate for the LOD domain. 
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Regarding services for end users, a next step is to device methods for clustering the 
set of similar entities. Finally, as in web searching, log analysis can be exploited 
for improving the computation of similarities at application layer. 

Moreover, we would like to note that as the number of sources increases, the 
need for ontology matching techniques (and lexical similarity functions) increases 
as well. In our application, and since we used two sources of information, we did not 
face this problem. In any case, the approach presented in this paper can be applied 
after applying entity matching approaches. A related issue is the management of 
the sameAs predicate. In brief, if two entities are related with such relationships, 
then they should be treated as equal by the similarity function. Another direction 
is to consider weighted triples, e.g. investigate a representation framework like 
Fuzzy RDF [35], and investigate similarity functions for such KBs (an extension 
of the faceted browsing for such sources is described at [3T] . 
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